* Author: Alexander Moradi

* This do file creates the All-Africa data set (cross section)
* 1. Merge data at gridcell level
* 2. Merge data at ethnic group level (from Murdock)
* 3. Distance calculations (distance to national cities, slave ports, Cities in 1400 and 1800
* 4. Cash crop value in 1900 and 1924

* The do file uses the geodist package
ssc install geodist

* Set Directory path using cd

do "Grid Africa\compile_grid_mines.do" // mining location and data of opening

do "Grid Africa\compile_grid_missions.do" // mission data

do "Grid Africa\compile_grid_pop.do" // HYDE 3.1


********************************
*****Data sets used to compile All Africa grid data:

** - grid_pop // population 1400-1800 from HYDE 3.1

** - grid_popurbrur // urban and rural population 1900 from from HYDE 3.0
** - grid_pop2000 // urban population 1900-2010

** - grid_muslim // distance to a muslim seat/ centre, distance to muslim belt, in muslim belt

** - grid_murdock // Ethnic group level data for polygamy monogamy slavery polity missing_murdock v99"
** - grid_malaria  // mean of sickle cell map
** - grid_tsetse // mmean of tsetse from map Alsan (2015)

** - grid_mining // number of mines existing 1900, 1914 & 1924. Source: SNL &  Remi based on USGS
** - grid_climate // Rainfall and Temperature 1900-1929 from CPRU and GPCC respectively - For 2.5° grid resolution: Schneider, Udo; Becker, Andreas; Finger, Peter; Meyer-Christoffer, Anja; Rudolf, Bruno; Ziese, Markus (2011): GPCC Full Data Reanalysis Version 6.0 at 2.5°: Monthly Land-Surface Precipitation from Rain-Gauges built on GTS-based and Historic Data. DOI: 10.5676/DWD_GPCC/FD_M_V7_250
** - grid_altitude // based on 250mx250m raster resolution
** - grid_dist2coast // Distance to coast 

** - grid_crops // crop suitatability from GAEZ, index 0-10000

** - grid_dist2rail // Distance to railroad
** - grid_dist2placeborail // Distance to placebo routes
** - grid_dist2lake // Distance to lakes in km
** - grid_dist2river /// Distance to navigable river in km
** - grid_dist2explorer // Distance to Nunn's explorer routes in km

** - grid_nightlights // Version 4 DMSP-OLS Nighttime Lights Time Series for the years 1992/93, 2000/01, 2002/03 and 2012/2013 (averages of mean value)

** - grid_gridcell2 // Gridcell ID aggregating 0.1x01 cells into 0.2x0.2 cells
** - grid_soil // soil suitable for cultivation from "Map 6.65: Combined suitability of currently available land for pasture and rainfed crops (low input level)" includes land well-suited for rainfed crops & prime land for rainfed crops + irrigaged areas
** - grid_soil5 // soil fertility index Africa map %afferl, 5 categories (0-20, 20-40, 40-60, 60-80, 80-100)
** - grid_area // share of land that is not covered by water (lake/ocean)

** - col_date: Date of colonisation of ethnic group from Henderson & Whatley - based on Thomas Pakenham's "The Scramble for Africa." 

** Used original raster and Ordinary Kriging method (spherical) and interpolate to full extent
* precipitation & temperature (grid_climate)
* sickle cell malaria (grid_malaria)
* tsetse (grid_tsetse)
* cash crops - there are just a few cells along the coasts (but that's also where a disproportionate number of mission are)


*********************************************************************************
*** 1. Merge data at gridcell level

use "Grid Africa\Files\grid_gridcell.dta", replace

** 1a) Merge data sets
foreach i in grid_climate grid_altitude grid_pop grid_popurbrur grid_pop2000 grid_muslim grid_malaria grid_tsetse grid_mining grid_crops grid_soil grid_soil5 grid_area grid_dist2rail grid_dist2placeborail grid_dist2lake grid_dist2explorer grid_dist2river grid_dist2coast  {
merge 1:1 gridcell using "Grid Africa/Files/`i'.dta", nogenerate
}

*** 1b) Housekeeping ***

** Missing values in HYDE's urbc and rurc replaced with 0 (so to not add population, classic measurement error)
replace urbc=0 if urbc==.
replace rurc=0 if rurc==.
gen lnrurc=ln(1+rurc)
replace lnrurc=0 if rurc==.
gen lnurbc=ln(1+urbc)
replace lnurbc=0 if urbc==.
gen rurc0=(rurc==0)
gen urbc0=(urbc==0)
replace count_hyde=0 if count_hyde ==.

label var rurc0 "Cell has 0 rural population in 1900 (HYDE 3.0)"
label var urbc0 "Cell has 0 urban population in 1900 (HYDE 3.0)"
label var lnrurc "ln(1+rurc1900)"
label var lnurbc "ln(1+urbc1900)"
order rurc0 urbc0 lnrurc lnurbc, after (rurc1900)

drop srurc surbc // these are the sum of rural/urban population in 1900 (HYDE 3.0)

** Merge with grid_country // there is one duplicate which is the only case where a Beach/Roome mission station are in the same grid but in 2 different countries - Rwanda and DRC
merge m:m gridcell using "Grid Africa/Files/grid_country.dta", nogenerate force

* Drop cells in the sea
* scatter centroidy centroidx if land==. & lnrurc>0 // missing land values are i) coastal (in the sea), ii) in the Sahara, iii) Cap Verde
drop if land==. | country==""
drop land

** Altitude data only 212,207 cells
replace alt_mean=1 if alt_mean==. // not overlaid by elevation map - coastal grids only, set to 1
replace area=0.000004 // not overlaid by elevantation map - coastal grids only, set to lowest value in the dataset
replace alt_std=0 if alt_std==. // For 779 grid cells std_alt=0, which creates missing values when taking the log => replace missing values in std_alt with lowest value of ln(std_alt)
gen lnalt_std=ln(alt_std)
egen min=min(lnalt_std)
replace lnalt_std=min if lnalt_std==. // missing almost entirely for coastal gridcells
replace alt_mean=2 if alt_mean==. // missing almost entirely for coastal gridcells
drop min

** Missing values in mining data is 0
foreach i of varlist mines1924_50km mines1914_50km mines1900_50km mines1924_20km mines1914_20km mines1900_20km n_mines1924_50km n_mines1900_50km n_mines1924_20km n_mines1900_20km {
replace  `i'=0 if `i'==.
}

* Distance to navigable River, lakes & Explorer route
gen dist2river_nav10km=(dist2river_nav<=10)
gen dist2explorer_nunn10km=(dist2explorer<=10)
gen dist2lake10km=(dist2lake<=10)
gen dist2rail1900_10km=(dist2rail1900<=10)
gen dist2rail1924_10km=(dist2rail1924<=10)

label var dist2river_nav10km "Grid within 10 km of navigable river"
label var dist2explorer_nunn10km "Grid within 10 km of explorer route (Source: Nunn)"
label var dist2lake10km "Grid within 10 km of inland lake"

** Distance to coast & muslimbelt
gen lndist2muslimbelt=ln(dist2muslimbelt+1)
gen lndist2muslimcentre=ln(dist2muslimcentre+1)
gen lndist2coast=ln(dist2coast+1)


*********************************************************************************
* 1c) Merge with mission data
merge 1:1 gridcell country using "Grid Africa/Files/grid_missions.dta", nogenerate force // one gridcell divided between two countries with a mission station each, treated as two separate observations

** Code Extensive margin of mission stations
gen missionnunn_dummy=(protest_nunn+catholic_nunn>=1)
label var missionnunn_dummy "Mission in the cell (Roome map)"
gen missioncr_dummy=(protestant_cr>=1)
label var missioncr_dummy "Mission in the cell (Cage & Rueda map)"
gen n_missionnunn=protest_nunn+catholic_nunn
label var n_missionnunn "Number of Catholic and Protestant missions in the cell (Roome map)"
*********************************************************************************

drop col_name row_name bfbs_nunn

*********************************************************************************



*********************************************************************************
* 2. Merge data at ethnic group level (from Murdock)

*** 2a) Merge data sets ***

** Murdock variables
merge m:m ethnic using "Grid Africa/Files/grid_murdock.dta" // ethnic is murdock_name
drop if country==""
// Ethnic groups are assigned to gridcells. Issue if ethnic group close to country border. It then can be assigned to the wrong country
replace country="Burkina Faso" if country=="Ghana" & ethnic=="LOBI"
replace country="Cameroon" if country=="Nigeria" & ethnic=="PODOKWO"
replace country="DRC" if country=="Congo" & ethnic=="EKONDA"
replace country="Niger" if country=="Burkina Faso" & ethnic=="ZERMA"
replace country="Niger" if country=="Nigeria" & ethnic=="KURFEI"
replace country="Niger" if country=="Nigeria" & ethnic=="MAURI"
replace country="Sudan" if country=="DRC" & ethnic=="MORU"
replace country="Uganda" if country=="Kenya" & ethnic=="SOGA"
drop _merge

** Nunn's slave exports
merge m:m ethnic using "Grid Africa/Files/tribe_level_slave_exports_Atlantic_Indian.dta" // ethnic is murdock_name
replace ln_export_area=0 if ln_export_area==. // Nunn excluded uninhabitated places: UNINHABITED (West Saharan Desert), UNINHABITED (Libyan Desert), UNINHABITED (Madeira)
drop if _merge==2 // Checked: These are ethnic groups in North Africa (except for Comorians - Comoros Islands are not included in the dataset), 
drop _merge

** Date of colonisation
merge m:m ethnic country using "Grid Africa/Files/date_col.dta" // ethnic is murdock_name, country names adjusted to allow for match
* Issue: gridcells do not allow for the same level of resolution as polygons. This causes a mismatch between the 2 data sets 
drop if _merge==2 // Ignore _merge==2: ethnicity-country coding, if there are no gridcells with that ethnicity-country coding (this happens when the grid centroid is outside the ethnicity-country observation, but the polygon is not
drop _merge
gen nevercolonised=(date_col==.)
label var nevercolonised "Country never colonised (Ethiopia & LIberia)"
replace date_col=0 if date_col==.

* Ethnic group has access to coast
merge m:m ethnic using "Grid Africa/Files/coastal.dta" // ethnic is murdock_name
replace coastal=0 if coastal==.
label var coastal "Ethnic groups has access to coast"
drop if _merge==2 // Checked: These are ethnic groups in North Africa (except for Comorians - Comoros Islands are not included in the dataset)
drop _merge



*** 2b) Housekeeping ***

* Generate ethnic group & country fixed effects
egen ethnicfe=group(ethnic)
egen countryfe=group(country)
egen districtfe=group(country region district2000)
********************************************************************************








************************************************
* 3. Distance calculations

**** 3a)  Distance to National Cities
tempfile grid
save "`grid'"


* Importing the city data set
import excel "Grid Africa\Files\cities1901_v7.xls", sheet("Feuil1") firstrow case(lower) clear
rename longitude city_longitude
rename latitude city_latitude
replace country="CAR" if country=="Central African Republic"
replace country="DRC" if country=="Zaire"
replace country="Congo" if country=="Congo-Brazzaville"
replace country="Guinea Bissau" if country=="Guinea-Bissau"
replace country="Cote d' Ivoire" if country=="Ivory Coast"

** Match with grid data
joinby country using `grid', unmatched(both)
geodist city_latitude city_longitude centroidy centroidx, gen(dist2city)
gen dist2nationalcity=dist2city if national==1
replace pop1900=pop_circa1900 if (pop1900==. | pop1900==0) & pop_circa1900!=.
gen pop1900_10=pop1900 if dist2city<=10 & pop1900>=10000
collapse (min) dist2nationalcity (sum) pop1900_10, by(gridcell country)
drop if gridcell==""
label var dist2nationalcity "Distance to closest national city (km)"
label var pop1900_10 "Total urban population (localities > 10,000 inh.) of the cell in 1900"

tempfile city
save "`city'"

* Merge with the grid data
use `grid',  clear
merge 1:1 gridcell country using `city', nogenerate

************************************************


************************************************
**** 3b) Distance to Slave ports
save "`grid'", replace

keep gridcell centroidx centroidy 
tempfile grid_reduced
save "`grid_reduced'"

* Importing the slave port data set
import excel "Grid Africa\Files\Slave port.xlsx", sheet("Sheet") firstrow case(lower) clear
rename longitude port_longitude
rename latitude port_latitude

drop if exp1801_1900==0 // only use slave ports in use 1801-1900
keep port* 

cross using `grid_reduced'
geodist  port_latitude port_longitude centroidy centroidx, gen(dist2slaveport)
collapse (min) dist2slaveport, by(gridcell)
label var dist2slaveport "Distance to closest slave port 1801-1900 (km)"

tempfile port
save "`port'"

* Merge with the grid data set
use `grid',  clear
merge m:1 gridcell using `port'
drop _merge

gen dist2slaveport_10km=(dist2slaveport<=10)
label var dist2slaveport_10km "Grid within 10 km of a slave port 1801-1900"

gen lndist2slaveport=ln(dist2slaveport)
egen min=min(lndist2slaveport)
replace lndist2slaveport=min if lndist2slaveport==. & dist2slaveport!=.

************************************************


************************************************
*** 3c) Distance to Cities in 1400 and 1800 from Chandler
save "`grid'", replace

keep gridcell centroidx centroidy 
tempfile grid_reduced
save "`grid_reduced'"

* Importing the slave port data set
import excel "Grid Africa\Files\Chandler_1400_1800_africa.xlsx", sheet("Sheet3") firstrow case(lower) clear

cross using `grid_reduced'
geodist  lat1400 lon1400 centroidy centroidx, gen(dist2city1400)
geodist  lat1800 lon1800 centroidy centroidx, gen(dist2city1800)

collapse (min) dist2city1400 dist2city1800, by(gridcell)
label var dist2city1400 "Distance to city 1400 (km)"
label var dist2city1800 "Distance to city 1800 (km)"

gen city1400=(dist2city1400<=10)
gen city1800=(dist2city1800<=10)
label var city1400 "City in 1400 (within 10km of gridcell centroid)"
label var city1800 "City in 1800 (within 10km of gridcell centroid)"

tempfile city
save "`city'"

* Merge with the grid data set
use `grid',  clear
merge m:1 gridcell using `city', nogenerate
*******************************************************************************





********************************************************************************
* 4. Cash crop value in 1900 and 1924
* * multiply gridcell soil quality with value weighted by the share of the gridcell index as a share of total of the country
* a) Calculate colony sum of GAEZ index (colony is the level for which we have export statistics). The colony  aggregation is shown by variable <col>.
* b) Calculate weight of gridcell by dividing GAEZ gridcell index by total GAEZ of all gridcells in a colony (-> better soils get higher weight, e.g. a cell of index=5000 would get double the weight as a cell of index=2500) 
* c) Getting the export statistics dataset into shape (cleaning colony names, calculate export value by cash crop and colony, reshape cash crop from row to column)
* d) merge with grid data set by colony!

** 0) Coding countries into colonies for which we have export statistics
gen col=country
// The French statistical abstract explicitly cites the following countries: République du Congo, Soudan française (Mali), Guinée française and Côte d'Ivoire. This in all likelihood also includes Burkina Faso and Niger, and prior to 1910 French Gabon and Chad. 
replace col="COA" if country=="Burkina"
replace col="COA" if country=="Guinea"
replace col="COA" if country=="Ivory Coast"
replace col="COA" if country=="Mali"
replace col="COA" if country=="Niger"
replace col="COA" if country=="Congo"
// Etablissements Francaises du Golfe de Guinee; Only available from 1890 to 1925 in the French trade statistics. Covers Togo and Dahomey.
// replace col="EFG" if country=="Togo" & year==1924 // export statistics for German Togoland available for 1899 and 1924
replace col="EFG" if country=="Benin" // values of Togo to be subtracted (for 1924 only?)

replace col="EAF" if country=="Central African Republic"
replace col="EAF" if country=="Chad"
replace col="EAF" if country=="Congo"
replace col="EAF" if country=="Gabon"

replace col="Gambia, The" if country=="Gambia"
replace col="Gold Coast" if country=="Ghana"
replace col="Madagascar" if country=="Madagascar"
replace col="Nigeria" if country=="Nigeria"
replace col="Northern Rhodesia" if country=="Zambia"
replace col="Nyasaland" if country=="Malawi"
replace col="Senegal" if country=="Senegal"
replace col="Sierra Leone" if country=="Sierra Leone"
replace col="Somaliland" if country=="Somalia"
replace col="Southern Rhodesia" if country=="Zimbabwe"
replace col="Tanganyika" if country=="Tanzania"

replace col="Ruanda-Urundi" if country=="Rwanda" | country=="Burundi"


* a) Calculate colony sum of GAEZ index (colony is the level for which we have export statistics)
sort col
foreach control_crops of varlist cocoa coffee cotton groundnut oilpalm tea tobacco {
by col: egen sumGAEZ`control_crops'=total(`control_crops')
label var sumGAEZ`control_crops' "Total GAEZ index `control_crops', summed over colony"
}

* b) Calculate weight of individual grid 
sort col
foreach control_crops of varlist cocoa coffee cotton groundnut oilpalm tea tobacco {
by col: gen weightGAEZ`control_crops'=`control_crops'/sumGAEZ`control_crops'
replace weightGAEZ`control_crops'=0 if weightGAEZ`control_crops'==. // for some colonies, there is no suitable soil (sum of GAEZ indices=0, hence gridcell drops due to division by 0
label var weightGAEZ`control_crops' "ratio GAEZ index grid / total of colony, `control_crops'"
}

tempfile grid
save "`grid'",  replace

** c) Getting the export statistics dataset into shape
do "Grid Africa\calculate_export_values.do" // this do file brings export statistics into shape merging data from different source


* d) Merge export statistics by colony and cash crop with grid dataset
use `grid', replace
merge m:1 col using "Grid Africa\Files\crops1924.dta"
drop if _merge==2
drop _merge

merge m:1 col using "Grid Africa\Files\crops1900.dta"
drop if _merge==2
drop _merge

merge m:1 country using "Grid Africa\Files\crops_post1961.dta"
drop if _merge==2
drop _merge

foreach yr of numlist 1900 1924 1961 2000 {
** Calculate grid export value & replace missing values in grid value to 0
foreach i of varlist cocoa coffee cotton groundnut oilpalm tea tobacco {
replace value`yr'`i'=0 if value`yr'`i'==.
gen gridvalue`yr'`i'=weightGAEZ`i'*value`yr'`i'
replace gridvalue`yr'`i'=0 if value`yr'`i'==.
label var gridvalue`yr'`i' "Export value of `i' in `yr' assigned to grid (by contribution of grid's GAEZ to colony's GAEZ)"
}

gen gridvalue`yr'=gridvalue`yr'cocoa+gridvalue`yr'coffee+gridvalue`yr'cotton+gridvalue`yr'groundnut+gridvalue`yr'oilpalm+gridvalue`yr'tea+gridvalue`yr'tobacco+1
gen lngridvalue`yr'=ln(gridvalue`yr')
label var gridvalue`yr' "Export values from all cash crops assigned to cell in `yr' (1900/1924 in £, 1960/2000 in 1000USD)"
label var lngridvalue`yr' "Ln[Export value of any cash crop in `yr']" // assigned to grid (by contribution of grid's GAEZ to colony's GAEZ)]
}
********************************************************************************


** Labeling variables
drop area land_area min

label var dist2rail1900_10km "Grid within 10 km of railroad line as of 1900"
label var dist2rail1924_10km "Grid within 10 km of railroad line as of 1924"
label var col "Colony"

label var gridcell "Gridcell ID 0.1 x 0.1"

order gridcell gridcell2 centroidx centroidy country region2000 district2000 ethnic missionnunn_dummy missioncr_dummy n_missionnunn catholic_nunn protest_nunn protestant_cr_original print1903_original alt_mean alt_std lnalt_std temp precip
order urbc1900 rurc1900 rurc0 urbc0 lnrurc lnurbc, after(spop1400)

save "Grid Africa\grid_africa_01182018.dta", replace
******************************************